Finding and evaluating community structure in networks 
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We propose and study a set of algorithms for discovering community structure in networks — 
natural divisions of network nodes into densely connected subgroups. Our algorithms all share two 
definitive features: first, they involve iterative removal of edges from the network to split it into 
communities, the edges removed being identified using one of a number of possible "betweenness" 
measures, and second, these measures are, crucially, recalculated after each removal. We also propose 
a measure for the strength of the community structure found by our algorithms, which gives us an 
objective metric for choosing the number of communities into which a network should be divided. 
We demonstrate that our algorithms are highly effective at discovering community structure in both 
computer-generated and real-world network data, and show how they can be used to shed light on 
the sometimes dauntingly complex structure of networked systems. 



I. INTRODUCTION 

Empirical studies and theoretical modeling of networks 
have been the subject of a large body of recent research in 
statistical physics and applied mathematics 0, 0, 0, 0] ■ 
Network ideas have been applied with great success to 
topics as diverse as the Internet and the world wide 
web 

II II 0, epidemiology ijl G3 EH, scientific ci- 
tation and collaboration |l2t Il3|. metabolism [Til [l5| . 
and ecosystems 0,0, to name but a few. A property 
that seems to be common to many networks is commu- 
nity structure, the division of network nodes into groups 
within which the network connections are dense, but be- 
tween which they are sparser — see Fig. The ability to 
find and analyze such groups can provide invaluable help 
in understanding and visualizing the structure of net- 
works. In this paper we show how this can be achieved. 

The study of community structure in networks has a 
long history. It is closely related to the ideas of graph 
partitioning in graph theory and computer science, and 




FIG. 1: A small network with community structure of the 
type considered in this paper. In this case there are three 
communities, denoted by the dashed circles, which have dense 
internal links but between which there are only a lower density 
of external links. 



hierarchical clustering in sociology [l8L 0] . Before pre- 
senting our own findings, it is worth reviewing some of 
this preceding work, to understand its achievements and 
where it falls short. 

Graph partitioning is a problem that arises in, for ex- 
ample, parallel computing. Suppose we have a num- 
ber n of intercommunicating computer processes, which 
we wish to distribute over a number g of computer proces- 
sors. Processes do not necessarily need to communicate 
with all others, and the pattern of required communica- 
tions can be represented by a graph or network in which 
the vertices represent processes and edges join process 
pairs that need to communicate. The problem is to allo- 
cate the processes to processors in such a way as roughly 
to balance the load on each processor, while at the same 
time minimizing the number of edges that run between 
processors, so that the amount of interprocessor commu- 
nication (which is normally slow) is minimized. In gen- 
eral, finding an exact solution to a partitioning task of 
this kind is believed to be an NP-complete problem, mak- 
ing it prohibitively difficult to solve for large graphs, but 
a wide variety of heuristic algorithms have been devel- 
oped that give acceptably good solutions in many cases, 
the best known being perhaps the Kernighan-Lin algo- 
rithm 20], which runs in time 0(n 3 ) on sparse graphs. 

A solution to the graph partitioning problem is how- 
ever not particularly helpful for analyzing and under- 
standing networks in general. If we merely want to find 
if and how a given network breaks down into commu- 
nities, we probably don't know how many such com- 
munities there are going to be, and there is no reason 
why they should be roughly the same size. Furthermore, 
the number of inter-community edges needn't be strictly 
minimized either, since more such edges are admissible 
between large communities than between small ones. 

As far as our goals in this paper are concerned, a more 
useful approach is that taken by social network analysis 
with the set of techniques known as hierarchical cluster- 
ing. These techniques are aimed at discovering natural 
divisions of (social) networks into groups, based on var- 
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FIG. 2: A hierarchical tree or dendrogram illustrating the 
type of output generated by the algorithms described here. 
The circles at the bottom of the figure represent the indi- 
vidual vertices of the network. As we move up the tree the 
vertices join together to form larger and larger communities, 
as indicated by the lines, until we reach the top, where all are 
joined together in a single community. Alternatively, we the 
dendrogram depicts an initially connected network splitting 
into smaller and smaller communities as we go from top to 
bottom. A cross-section of the tree at any level, as indicated 
by the dotted line, will give the communities at that level. 
The vertical height of the split-points in the tree are indica- 
tive only of the order in which the splits (or joins) took place, 
although it is possible to construct more elaborate dendro- 
grams in which these heights contain other information. 



ious metrics of similarity or strength of connection be- 
tween vertices. They fall into two broad classes, agglom- 
erative and divisive [19j, depending on whether they fo- 
cus on the addition or removal of edges to or from the net- 
work. In an agglomerative method, similarities are cal- 
culated by one method or another between vertex pairs, 
and edges are then added to an initially empty network 
(n vertices with no edges) starting with the vertex pairs 
with highest similarity. The procedure can be halted at 
any point, and the resulting components in the network 
are taken to be the communities. Alternatively, the en- 
tire progression of the algorithm from empty graph to 
complete graph can be represented in the form of a tree 
or dendrogram such as that shown in Fig. [21 Horizontal 
cuts through the tree represent the communities appro- 
priate to different halting points. 

Agglomerative methods based on a wide variety of sim- 
ilarity measures have been applied to different networks. 
Some networks have natural similarity metrics built in. 
For example, in the widely studied network of collabo- 
rations between film actors pll 1221 ] , in which two actors 
are connected if they have appeared in the same film, one 
could quantify similarity by how many films actors have 
appeared in together [23 . Other networks have no natu- 
ral metric, but suitable ones can be devised using correla- 
tion coefficients, path lengths, or matrix methods. A well 
known example of an agglomerative clustering method is 
the Concor algorithm of Breiger et al. [24| . 

Agglomerative methods have their problems however. 
One concern is that they fail with some frequency to find 
the correct communities in networks were the commu- 
nity structure is known, which makes it difficult to place 
much trust in them in other cases. Another is their ten- 



























FIG. 3: Agglomerative clustering methods are typically good 
at discovering the strongly linked cores of communities (bold 
vertices and edges) but tend to leave out peripheral vertices, 
even when, as here, most of them clearly belong to one com- 
munity or another. 



dency to find only the cores of communities and leave 
out the periphery. The core nodes in a community of- 
ten have strong similarity, and hence are connected early 
in the agglomerative process, but peripheral nodes that 
have no strong similarity to others tend to get neglected, 
leading to structures like that shown in Fig. [21 In this 
figure, there are a number of peripheral nodes whose com- 
munity membership is obvious to the eye — in most cases 
they have only a single link to a specific community — 
but agglomerative methods often fail to place such nodes 
correctly. 

In this paper, therefore, we focus on divisive meth- 
ods. These methods have been relatively little studied 
in the previous literature, either in social network the- 
ory or elsewhere, but, as we will see, seem to offer a 
lot of promise. In a divisive method, we start with the 
network of interest and attempt to find the least similar 
connected pairs of vertices and then remove the edges 
between them. By doing this repeatedly, we divide the 
network into smaller and smaller components, and again 
we can stop the process at any stage and take the com- 
ponents at that stage to be the network communities. 
Again, the process can be represented as a dendrogram 
depicting the successive splits of the network into smaller 
and smaller groups. 

The approach we take follows roughly these lines, 
but adopts a somewhat different philosophical viewpoint. 
Rather than looking for the most weakly connected ver- 
tex pairs, our approach will be to look for the edges in the 
network that are most "between" other vertices, meaning 
that the edge is, in some sense, responsible for connect- 
ing many pairs of others. Such edges need not be weak 
at all in the similarity sense. How this idea works out in 
practice will become clear in the course of the presenta- 
tion. 

Briefly then, the outline of this paper is as follows. 
In Sec. [H] we describe the crucial concepts behind our 
methods for finding community structure in networks and 
show how these concepts can be turned into a concrete 
prescription for performing calculations. In Sec. IIIII we 
describe in detail the implementation of our methods. In 
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Sec. lIVI we consider ways of determining when a particu- 
lar division of a network into communities is a good one, 
allowing us to quantify the success of our community- 
finding algorithms. And in Sec. [V] we give a number 
of applications of our algorithms to particular networks, 
both real and artificial. In Sec. IVII we give our conclu- 
sions. A brief report of some of the work contained in 
this paper has appeared previously as Ref. |2fJ 



II. FINDING COMMUNITIES IN A NETWORK 

In this paper we present a class of new algorithms 
for network clustering, i.e., the discovery of community 
structure in networks. Our discussion focuses primarily 
on networks with only a single type of vertex and a single 
type of undirected, unweighted edge, although general- 
izations to more complicated network types are certainly 
possible. 

There are two central features that distinguish our al- 
gorithms from those that have preceded them. First, our 
algorithms are divisive rather than agglomerative. Di- 
visive algorithms have occasionally been studied in the 
past, but, as discussed in the introduction, ours differ 
in focusing not on removing the edges between vertex 
pairs with lowest similarity, but on finding edges with 
the highest "betweenness," where betweenness is some 
measure that favors edges that lie between communities 
and disfavors those that lie inside communities. 

To make things more concrete, we give some examples 
of the types of betweenness measures we will be looking 
at. All of them are based on the same idea. If two com- 
munities are joined by only a few inter-community edges, 
then all paths through the network from vertices in one 
community to vertices in the other must pass along one 
of those few edges. Given a suitable set of paths, one can 
count how many go along each edge in the graph, and 
this number we then expect to be largest for the inter- 
community edges, thus providing a method for identify- 
ing them. Our different measures correspond to various 
implementations of this idea. 

1. The simplest example of such a betweenness mea- 
sure is that based on shortest (geodesic) paths: we 
find the shortest paths between all pairs of vertices 
and count how many run along each edge. To the 
best of our knowledge this measure was first intro- 
duced by Anthonisse in a never-published technical 
report in 1971 [2(|. Anthonisse called it "rush," 
but we prefer the term edge betweenness, since the 
quantity is a natural generalization to edges of the 
well-known (vertex) betweenness measure of Free- 
man .27], which was the inspiration for our ap- 
proach. When we need to distinguish it from the 
other betweenness measures considered in this pa- 
per, we will refer to it as shortest-path betweenness. 
A fast algorithm for calculating the shortest-path 
betweenness is given in Sec. IIII Al 



2. The shortest-path betweenness can be thought of 
in terms of signals traveling through a network. 
If signals travel from source to destination along 
geodesic network paths, and all vertices send sig- 
nals at the same constant rate to all others, then 
the betweenness is a measure of the rate at which 
signals pass along each edge. Suppose however that 
signals do not travel along geodesic paths, but in- 
stead just perform a random walk about the net- 
work until they reach their destination. This gives 
us another measure on edges, the random-walk be- 
tweenness: we calculate the expected net number 
of times that a random walk between a particular 
pair of vertices will pass down a particular edge 
and sum over all vertex pairs. The random-walk 
betweenness can be calculated using matrix meth- 
ods, as described in Sec. IIII CI 

3. Another betweenness measure is motivated by ideas 
from elementary circuit theory. We consider the 
circuit created by placing a unit resistance on each 
edge of the network and unit current source and 
sink at a particular pair of vertices. The resulting 
current flow in the network will travel from source 
to sink along a multitude of paths, those with least 
resistance carrying the greatest fraction of the cur- 
rent. The current-flow betweenness for an edge 
we define to be the absolute value of the current 
along the edge summed over all source/sink pairs. 
It can be calculated using Kirchhoff's laws, as de- 
scribed in Sec. IIII Bl In fact, as we will show, the 
current-flow betweenness turns out to be exactly 
the same as the random walk betweenness of the 
previous paragraph, but we nonetheless consider it 
separately since it leads to a simpler derivation of 
the measure. 

These measures are only suggestions; many others are 
possible and may well be appropriate for specific applica- 
tions. Measures (1) and (2) are in some sense extremes in 
the spectrum of possibilities, one corresponding to signals 
that know exactly where they are going, and the other 
to signals that have no idea where they are going. As 
we will see, however, these two measures actually give 
rather similar results, indicating that the precise choice 
of betweenness measure may not, at least for the types 
of applications considered here, be that important. 

The second way in which our methods differ from pre- 
vious ones is in the inclusion of a "recalculation step" in 
the algorithm. If we were to perform a standard divisive 
clustering based on edge betweenness we would calculate 
the edge betweenness for all edges in the network and 
then remove edges in decreasing order of betweenness to 
produce a dendrogram like that of Fig. [21 showing the 
order in which the network split up. 

However, once the first edge in the network is removed 
in such an algorithm, the betweenness values for the re- 
maining edges will no longer reflect the network as it now 
is. This can give rise to unwanted behaviors. For exam- 



4 



pie, if two communities are joined by two edges, but, for 
one reason or another, most paths between the two flow 
along just one of those edges, then that edge will have a 
high betweenness score and the other will not. An algo- 
rithm that calculated bctwccnncsscs only once and then 
removed edges in betweenness order would remove the 
first edge early in the course of its operation, but the 
second might not get removed until much later. Thus 
the obvious division of the network into two parts might 
not be discovered by the algorithm. In the worst case the 
two parts themselves might be individually broken up be- 
fore the division between the two is made. In practice, 
problems like this crop up in real networks with some 
regularity and render algorithms of this type ineffective 
for the discovery of community structure. 

The solution, luckily, is obvious. We simply recalcu- 
late our betweenness measure after the removal of each 
edge. This certainly adds to the computational effort of 
performing the calculation, but its effect on the results is 
so desirable that we consider the price worth paying. 

Thus the general form of our community structure find- 
ing algorithm is as follows: 

1. Calculate betweenness scores for all edges in the 
network. 

2. Find the edge with the highest score and remove it 
from the network. 

3. Recalculate betweenness for all remaining edges. 

4. Repeat from step 2. 

In fact, it appears that the recalculation step is the 
most important feature of the algorithm, as far as getting 
satisfactory results is concerned. As mentioned above, 
our studies indicate that, once one hits on the idea of us- 
ing betweenness measures to weight edges, the exact mea- 
sure one uses appears not to influence the results highly. 
The recalculation step, on the other hand, is absolutely 
crucial to the operation of our methods. This step was 
missing from previous attempts at solving the cluster- 
ing problem using divisive algorithms, and yet without 
it the results are very poor indeed, failing to find known 
community structure even in the simplest of cases. In 
Sec. IV Bl we give an example comparing the performance 
of the algorithm on a particular network with and with- 
out the recalculation step. 

In the following sections we discuss implementation 
and give examples of our algorithms for finding commu- 
nity structure. For the reader who merely wants to know 
what algorithm they should use for their own problem, 
let us give an immediate answer: for most problems, we 
recommend the algorithm with betweenness scores cal- 
culated using the shortest-path betweenness measure (1) 
above. This measure appears to work well and is the 
quickest to calculate — as described in Sec. IIII Al it can 
be calculated for all edges in time O(mn), where m is 
the number of edges in the graph and n is the number 
of vertices. This is the only version of the algorithm 



that we discussed in Ref . |2^ ^3 ■ The other versions we 
discuss, while being of some pedagogical interest, make 
greater computational demands, and in practice seem to 
give results no better than the shortest-path method. 



III. IMPLEMENTATION 

In theory, the descriptions of the last section com- 
pletely define the methods we consider in this paper, but 
in practice there are a number of tricks to their imple- 
mentation that are important for turning the description 
into a workable computer algorithm. 

Essentially all of the work in the algorithm is in the 
calculation of the betweenness scores for the edges; the 
job of finding and removing the highest-scoring edge is 
trivial and not computationally demanding. Let us tackle 
our three suggested betweenness measures in turn. 



A. Shortest-path betweenness 

At first sight, it appears that calculating the edge be- 
tweenness measure based on geodesic paths for all edges 
will take 0(mn 2 ) operations on a graph with m edges 
and n vertices: calculating the shortest path between a 
particular pair of vertices can be done using breadth-first 
search in time 0(m) |23,|29|, and there are 0(n 2 ) ver- 
tex pairs. Recently however new algorithms have been 
proposed by Newman [30| and independently by Bran- 
des |81l that can perform the calculation faster than this, 
finding all betweennesses in O(mn) time. Both Newman 
and Brandes gave algorithms for the standard Freeman 
vertex betweenness, but it is trivial to adapt their algo- 
rithms for edge betweenness. We describe the resulting 
method here for the algorithm of Newman. 

Breadth-first search can find shortest paths from a sin- 
gle vertex s to all others in time O(m). In the simplest 
case, when there is only a single shortest path from the 
source vertex to any other (we will consider other cases 
in a moment) the resulting set of paths forms a shortest- 
path tree — see Fig.^J,. We can now use this tree to calcu- 
late the contribution to betweenness for each edge from 
this set of paths as follows. We find first the "leaves" of 
the tree, i.e., those nodes such that no shortest paths to 
other nodes pass through them, and we assign a score of 1 
to the single edge that connects each to the rest of the 
tree, as shown in the figure. Then, starting with those 
edges that are farthest from the source vertex on the tree, 
i.e., lowest in Fig.QJi, we work upwards, assigning a score 
to each edge that is 1 plus the sum of the scores on the 
neighboring edges immediately below it. When we have 
gone though all edges in the tree, the resulting scores 
are the betweenness counts for the paths from vertex s. 
Repeating the process for all possible vertices s and sum- 
ming the scores, we arrive at the full betweenness scores 
for shortest paths between all pairs. The breadth-first 
search and the process of working up through the tree 
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weight w a = 1. 




FIG. 4: Calculation of shortest-path betweenness: (a) When 
there is only a single shortest path from a source vertex s 
(top) to all other reachable vertices, those paths necessarily 
form a tree, which makes the calculation of the contribution 
to betweenness from this set of paths particularly simple, as 
describe in the text, (b) For cases in which there is more than 
one shortest path to some vertices, the calculation is more 
complex. First we must calculate the number of paths from 
the source to each other vertex (numbers on vertices), and 
then these are used to weight the path counts appropriately. 
In either case, we can check the results by confirming that the 
sum of the betweennesses of the edges connected to the source 
vertex is equal to the total number of reachable vertices — six 
in each of the cases illustrated here. 



both take worst-case time O(m) and there are n ver- 
tices total, so the entire calculation takes time 0{mn) as 
claimed. 

This simple case serves to illustrate the basic principle 
behind the algorithm. In general, however, it is not the 
case that there is only a single shortest path between any 
pair of vertices. Most networks have at least some vertex 
pairs between which there are several geodesic paths of 
equal length. Figure 0Jd shows a simple example of a 
shortest path "tree" for a network with this property. 
The resulting structure is in fact no longer a tree, and in 
such cases an extra step is required in the algorithm to 
calculate the betweenness correctly. 

In the traditional definition of vertex betweenness [27| 
multiple shortest paths between a pair of vertices are 
given equal weights summing to 1. For example, if there 
are three shortest paths, each will be given weight ^. 
We adopt the same definition for our edge betweenness 
(as did Anthonisse in his original work [2(j, although 
other definitions are possible |23). Note that the paths 
may run along the same edge or edges for some part of 
their length, resulting in edges with greater weight. To 
calculate correctly what fraction of the paths flow along 
each edge in the network, we generalize the breadth-first 
search part of the calculation, as follows. 

Consider Fig. 0b and suppose we are performing a 
breadth-first search starting at vertex s. We carry out 
the following steps: 

1. The initial vertex s is given distance d s = and a 



2. Every vertex i adjacent to s is given distance di = 
d 8 + 1 = 1, and weight Wi = w a = 1. 

3. For each vertex j adjacent to one of those vertices i 
we do one of three things: 

(a) If j has not yet been assigned a distance, it 
is assigned distance dj = di + 1 and weight 

Wj = Wi. 

(b) If j has already been assigned a distance and 
dj = di + 1, then the vertex's weight is in- 
creased by Wi , that is Wj <— Wj + Wi . 

(c) If j has already been assigned a distance and 
dj < di + 1, we do nothing. 

4. Repeat from step 3 until no vertices remain that 
have assigned distances but whose neighbors do not 
have assigned distances. 

In practice, this algorithm can be implemented most ef- 
ficiently using a queue or first-in/first-out buffer to store 
the vertices that have been assigned a distance, just as 
in the standard breadth-first search. 

Physically, the weight on a vertex i represents the num- 
ber of distinct paths from the source vertex to i. These 
weights are precisely what we need to calculate our edge 
betweennesses, because if two vertices i and j are con- 
nected, with j farther than i from the source s, then the 
fraction of a geodesic path from j through i to s is given 
by Wi/wj. Thus, to calculate the contribution to edge be- 
tweenness from all shortest paths starting at s, we need 
only carry out the following steps: 

1. Find every "leaf" vertex t, i.e., a vertex such that 
no paths from s to other vertices go though t. 

2. For each vertex i neighboring t assign a score to the 
edge from t to i of Wi /wt ■ 

3. Now, starting with the edges that are farthest from 
the source vertex s — lower down in a diagram such 
as Fig. 0Ja — work up towards s. To the edge from 
vertex i to vertex j, with j being farther from s 
than j, assign a score that is 1 plus the sum of 
the scores on the neighboring edges immediately 
below it (i.e., those with which it shares a common 
vertex), all multiplied by Wi/wj. 

4. Repeat from step 3 until vertex s is reached. 

Now repeating this process for all n source vertices s and 
summing the resulting scores on the edges gives us the 
total betweenness for all edges in time 0(mn). 

We now have to repeat this calculation for each edge 
removed from the network, of which there are to, and 
hence the complete community structure algorithm based 
on shortest-path betweenness operates in worst-case time 
0(m 2 n), or 0(n 3 ) time on a sparse graph. In our experi- 
ence, this typically makes it tractable for networks of up 
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to about n = 10 000 vertices, with current (circa 2003) 
desktop computers. In some special cases one can do 
better. In particular, we note that the removal of an 
edge only affects the betweenness of other edges that fall 
in the same component, and hence that we need only 
recalculate betweennesses in that component. Networks 
with strong community structure often break apart into 
separate components quite early in the progress of the al- 
gorithm, substantially reducing the amount of work that 
needs to be done on subsequent steps. Whether this re- 
sults in a change in the computational complexity of the 
algorithm for any commonly occurring classes of graphs 
is an open question, but it certainly gives a substantial 
speed boost to many of the calculations described in this 
paper. 

Some networks are directed, i.e., their edges run in 
one direction only. The world wide web is an example; 
links in the web point in one direction only from one web 
page to another. One could imagine a generalization of 
the shortest-path betweenness that allowed for directed 
edges by counting only those paths that travel in the 
forward direction along edges. Such a calculation is a 
trivial variation on the one described above. However, 
we have found that in many cases it is better to ignore 
the directed nature of a network in calculating commu- 
nity structure. Often an edge acts simply as an indication 
of a connection between two nodes, and its direction is 
unimportant. For example, in Ref. |25| we applied our 
algorithm to a food web of predator-prey interactions 
between marine species. Predator-prey interactions are 
clearly directed — one species may eat another, but it is 
unlikely that the reverse is simultaneously true. However, 
as far as community structure is concerned, we want to 
know only which species have interactions with which 
others. We find, therefore, that our algorithm applied 
to the undirected version of the food web works well at 
picking out the community structure, and no special al- 
gorithm is needed for the directed case. We give another 
example of our method applied to a directed graph in 
Sec. EDI 



B. Resistor networks 

As examples of betweenness measures that take more 
than just shortest paths into account, we proposed in 
Sec.lTTlmeasures based on random walks and resistor net- 
works. In fact, as we now show, when appropriately de- 
fined these two measures are precisely the same. Here we 
derive the resistance measure first, since it turns out to 
be simpler; in the following section we derive the random 
walk measure and show that the two are equivalent. 

Consider the network created by placing a unit resis- 
tance on every edge of our network, a unit current source 
at vertex s, and a unit current sink at vertex t (see Fig.[5j|. 
Clearly the current between s and t will flow primarily 
along short paths, but some will flow along longer ones, 
roughly in inverse proportion to their length. We will 



current in 
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FIG. 5: An example of the type of resistor network considered 
here, in which a unit resistance is placed on each edge and unit 
current flows into and out of the source and sink vertices. 

use the absolute magnitude of the current flow as our 
betweenness score for each source/sink pair. 

The current flows in the network are governed by 
KirchhofFs laws. To solve them we proceed as follows 
for each separate component of the graph. Let Vi be the 
voltage at vertex z, measured relative to any convenient 
point. Then for all i we have 

^A ij (V i -V j ) = 5 is -8 it , (1) 

3 

where A^ is the ij element of the adjacency matrix of 
the graph, i.e., = 1 if i and j are connected by an 
edge and Aij — otherwise. The left-hand side of Eq. 
represents the net current flow out of vertex i along edges 
of the network, and the right-hand side represents the 
source and sink. Defining ki = X^ ^y; which is the 
vertex degree, and creating a diagonal matrix D with 
these degrees on the diagonal Da — ki , this equation can 
be written in matrix form as (D — A) • V = s, where the 
source vector s has components 

{+1 for i = s 
-1 for i = t (2) 
otherwise. 

We cannot directly invert the matrix D — A to get the 
voltage vector V, because the matrix (which is just the 
graph Laplacian) is singular. This is equivalent to saying 
that there is one undetermined degree of freedom corre- 
sponding to the choice of reference potential for measur- 
ing the voltages. We can add any constant to a solution 
for the vertex voltages and get another solution — only 
the voltage differences matter. In choosing the reference 
potential, we fix this degree of freedom, leaving only n— 1 
more to be determined. In mathematical terms, once any 
n — 1 of the equations in our matrix formulation are sat- 
isfied, the remaining one is also automatically satisfied so 
long as current is conserved in the network as a whole, 
i.e., so long as X^i s « = 0, which is clearly true in this 
case. 

Choosing any vertex v to be the reference point, there- 
fore, we remove the row and column corresponding to 
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that vertex from D and A before inverting. Denoting 
the resulting (n — 1) X (n — 1) matrices D,j and A„, we 
can then write 

V = (D v -A v )- 1 -s. (3) 

Calculation of the currents in the network thus involves 
inverting D„ — A„ once for any convenient choice of v, 
and taking the differences of pairs of columns to get the 
voltage vector V for each possible source/sink pair. (The 
voltage for the one missing vertex v is always zero, by 
hypothesis.) The absolute magnitudes of the differences 
of voltages along each edge give us betweenness scores 
for the given source and sink. Summing over all sources 
and sinks, we then get our complete betweenness score. 

The matrix inversion takes time 0(n 3 ) in the worst 
case, while the subsequent calculation of betweennesses 
takes time 0(mn 2 ), where as before m is the number of 
edges and n the number of vertices in the graph. Thus, 
the entire community structure algorithm, including the 
recalculation step, will take 0((n+m)mri 2 ) time to com- 
plete, or 0(n 4 ) on a sparse graph. Although, as we will 
see, the algorithm is good at finding community struc- 
ture, this poor performance makes it practical only for 
smaller graphs; a few hundreds of vertices is the most 
that we have been able to do. It is for this reason that 
we recommend using the shortest-path betweenness al- 
gorithm in most cases, which gives results about as good 
or better with considerably less effort. 

C. Random walks 

The random- walk betweenness described in Sec. [H] re- 
quires us to calculate how often on average random walks 
starting at vertex s will pass down a particular edge from 
vertex v to vertex w (or vice versa) before finding their 
way to a given target vertex t. To calculate this quantity 
we proceed as follows for each separate component of the 
graph. 

As before, let A^ be an element of the adjacency ma- 
trix such that = 1 if vertices i and j are connected by 
an edge and = otherwise. Consider a random walk 
that on each step decides uniformly between the neigh- 
bors of the current vertex j and takes a step to one of 
them. The number of neighbors is just the degree of the 
vertex kj — J^. Aij, and the probability for the transition 
from j to i is Aij/kj, which we can regard as an element 
of the matrix M = A ■ D _1 , where D is the diagonal 
matrix with Da = hi. 

We are interested in walks that terminate when they 
reach the target t, so that t is an absorbing state. The 
most convenient way to represent this is just to remove 
entirely the vertex t from the graph, so that no walk ever 
reaches any other vertex from t. Thus let M t = A t • D^ 1 
be the matrix M with the ith row and column removed 
(and similarly for A t and D t ). 

Now the probability that a walk starts at s, takes n 
steps, and ends up at some other vertex (not t), is given 



by the is element of M™, which we denote [M™]j s . In 
particular, walks end up at v and w with probabilities 
[M"]„ s and [M™],^, and of those a fraction l/k v and 
l/k w respectively then pass along the edge (v,w) in one 
direction or the other. (Note that they may also have 
passed along this edge an arbitrary number of times be- 
fore reaching this point.) Summing over all n, the mean 
number of times that a walk of any length traverses the 
edge from v to w is fc^ 1 [(I — M t ) _1 ]„ s , and similarly for 
walks that go from w to v. 

To highlight the similarity with the current-flow be- 
tweenness of Sec. IIII Bl let us denote these two numbers 
V v and V w respectively. Then we can write 

V = D 1 (I M,)- 1 • s = (D t - A*)" 1 • s, (4) 

where the source vector s is the vector whose components 
are all except for a single 1 in the position corresponding 
to the source vertex s. 

Now we define our random-walk betweenness for the 
edge (v, w) to be the absolute value of the difference of 
the two probabilities V v and V w , i.e., the net number of 
times the walk passes along the edge in one direction. 
This seems a natural definition — it makes little sense to 
accord an edge high betweenness simply because a walk 
went back and forth along it many times. It is the differ- 
ence between the numbers of times the edge is traversed 
in either direction that matters psf. 

But now we see that this method is very similar to the 
resistor network calculation of Sec. IIII Bl In that calcu- 
lation we also evaluated (D t — A t ) _1 • s for a suitable 
source vector and then took differences of the resulting 
numbers. The only difference is that in the current-flow 
calculation we had a sink term in s as well as a source. 
Purely for the purposes of mathematical convenience, we 
can add such a sink in the present case at the target ver- 
tex t — this makes no difference to the solution for V since 
the tth row has been removed from the equations anyway. 
By doing this, however, we turn the equations into pre- 
cisely the form of the current-flow calculation, and hence 
it becomes clear that the two measures are numerically 
identical, although their derivation is quite different. (It 
also immediately follows that we can remove any row or 
column and still get the same answer — it doesn't have 
to be row and column t, although physically this choice 
makes the most sense.) 

IV. QUANTIFYING THE STRENGTH OF 
COMMUNITY STRUCTURE 

As we show in Sec. our community structure algo- 
rithms do an excellent job of recovering known communi- 
ties both in artificially generated random networks and in 
real-world examples. However, in practical situations the 
algorithms will normally be used on networks for which 
the communities are not known ahead of time. This 
raises a new problem: how do we know when the commu- 
nities found by the algorithm are good ones? Our algo- 
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rithms always produce some division of the network into 
communities, even in completely random networks that 
have no meaningful community structure, so it would be 
useful to have some way of saying how good the struc- 
ture found is. Furthermore, the algorithms' output is 
in the form of a dendrogram which represents an entire 
nested hierarchy of possible community divisions for the 
network. We would like to know which of these divisions 
are the best ones for a given network — where we should 
cut the dendrogram to get a sensible division of the net- 
work. 

To answer these questions we now define a measure of 
the quality of a particular division of a network, which 
we call the modularity. This measure is based on a pre- 
vious measure of assortative mixing proposed by New- 
man [33j. Consider a particular division of a network 
into k communities. Let us define a k x k symmetric ma- 
trix e whose element is the fraction of all edges in the 
network that link vertices in community i to vertices in 
community j 0] . (Here we consider all edges in the orig- 
inal network — even after edges have been removed by the 
community structure algorithm our modularity measure 
is calculated using the full network.) 

The trace of this matrix Tr e = ^ ■ e« gives the fraction 
of edges in the network that connect vertices in the same 
community, and clearly a good division into communities 
should have a high value of this trace. The trace on its 
own, however, is not a good indicator of the quality of the 
division since, for example, placing all vertices in a single 
community would give the maximal value of Tre = 1 
while giving no information about community structure 
at all. 

So we further define the row (or column) sums = 
Y) j eij , which represent the fraction of edges that connect 
to vertices in community j. In a network in which edges 
fall between vertices without regard for the communities 
they belong to, we would have = a^a.,. Thus we can 
define a modularity measure by 

Q = Y,( e "- a *) = Tre - ||e 2 ||, (5) 

i 

where || x || indicates the sum of the elements of the ma- 
trix x. This quantity measures the fraction of the edges 
in the network that connect vertices of the same type 
(i.e., within-community edges) minus the expected value 
of the same quantity in a network with the same commu- 
nity divisions but random connections between the ver- 
tices. If the number of within-community edges is no bet- 
ter than random, we will get Q = 0. Values approaching 
Q = l, which is the maximum, indicate strong commu- 
nity structure [Hof. In practice, values for such networks 
typically fall in the range from about 0.3 to 0.7. Higher 
values are rare. 

The expected error on Q can be calculated by treating 
each edge in the network as an independent measurement 
of the contributions to the elements of the matrix e. A 
simple jackknife procedure works well [33l l34j. 

Typically, we will calculate Q for each split of a net- 



work into communities as we move down the dendrogram, 
and look for local peaks in its value, which indicate par- 
ticularly satisfactory splits. Usually wc find that there 
are only one or two such peaks and, as we will show in 
the next section, in cases where the community struc- 
ture is known beforehand by some means we find that 
the positions of these peaks correspond closely to the ex- 
pected divisions. The height of a peak is a measure of 
the strength of the community division. 



V. APPLICATIONS 

In this section we give a number of applications of our 
algorithms to particular problems, illustrating their op- 
eration, and their use in understanding the structure of 
complex networks. 



A. Tests on computer-generated networks 

First, as a controlled test of how well our algorithms 
perform, we have generated networks with known com- 
munity structure, to see if the algorithms can recognize 
and extract this structure. 

We have generated a large number of graphs with 
n = 128 vertices, divided into four communities of 32 
vertices each. Edges were placed independently at ran- 
dom between vertex pairs with probability p m for an edge 
to fall between vertices in the same community and p ut 
to fall between vertices in different communities. The 
values of p m and p out were chosen to make the expected 
degree of each vertex equal to 16. In Fig. we show 
a typical dendrogram from the analysis of such a graph 
using the shortest-path betweenness version of our algo- 
rithm. (In fact, for the sake of clarity, the figure is for 
a 64-node version of the graph.) Results for the random 
walk version are similar. At the top of the figure we also 
show the modularity, Eq. J3J), for the same calculation, 
plotted as a function of position in the dendrogram. That 
is, the graph is aligned with the dendrogram so that one 
can read off modularity values for different divisions of 
the network directly. As we can see, the modularity has a 
single clear peak at the point where the network breaks 
into four communities, as we would expect. The peak 
value is around 0.5, which is typical. 

In Fig. we show the fraction of vertices in our 
computer-generated network sample classified correctly 
into the four communities by our algorithms, as a func- 
tion of the mean number z out of edges from each vertex 
to vertices in other communities. As the figure shows, 
both the shortest-path and random-walk versions of the 
algorithm perform excellently, with more than 90% of all 
vertices classified correctly from z out = all the way to 
around z out = 6. Only for z out <^ 6 does the classifica- 
tion begin to deteriorate markedly. In other words, our 
algorithm correctly identifies the community structure in 
the network almost all the way to the point z on t — 8 at 
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FIG. 6: Plot of the modularity and dendrogram for a 64- 
vertex random community-structured graph generated as de- 
scribed in the text with, in this case, Z[ n — 6 and z out = 2. 
The shapes on the right denote the four communities in the 
graph and as we can see, the peak in the modularity (dotted 
line) corresponds to a perfect identification of the communi- 
ties. 
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n — a random walk 
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FIG. 7: The fraction of vertices correctly identified by our 
algorithms in the computer-generated graphs described in the 
text. The two curves show results for the edge betweenness 
(circles) and random walk (squares) versions of the algorithm 
as a function of the number of edges vertices have to others 
outside their own community. The point z out — 8 at the 
rightmost edge of the plot represents the point at which the 
graphs — in this example — have as many connections outside 
their own community as inside it. Each point is an average 
over 100 graphs. 



which each vertex has on average the same number of 
connections to vertices outside its community as it does 
to those inside. 

The shortest-path version of the algorithm does how- 
ever perform noticeably better than the random-walk 
version, especially for the more difficult cases where z out 
is large. Given that the random-walk algorithm is also 
more computationally demanding, there seems little rea- 
son to use it rather than the shortest-path algorithm, and 
hence, as discussed previously, we recommend the latter 
for most applications. (To be fair, the random-walk al- 
gorithm does slightly out-perform the shortest-path algo- 
rithm in the example addressed in the following section, 
although, being only a single case, it is hard to know 
whether this is significant.) 



B. Zachary's karate club network 

We now turn to applications of our methods to real- 
world network data. Our first such example is taken 
from one of the classic studies in social network anal- 
ysis. Over the course of two years in the early 1970s, 
Wayne Zachary observed social interactions between the 
members of a karate club at an American university |35| . 
He constructed networks of ties between members of the 
club based on their social interactions both within the 
club and away from it. By chance, a dispute arose dur- 
ing the course of his study between the club's adminis- 
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FIG. 8: The network of friendships between individuals in 
the karate club study of Zachary |3q|. The administrator and 
the instructor are represented by nodes 1 and 33 respectively. 
Shaded squares represent individuals to who ended up align- 
ing with the club's administrator after the fission of the club, 
open circles those who aligned with the instructor. 



trator and its principal karate teacher over whether to 
raise club fees, and as a result the club eventually split 
in two, forming two smaller clubs, centered around the 
administrator and the teacher. 

In Fig. [S] we show a consensus network structure ex- 
tracted from Zachary's observations before the split. 
Feeding this network into our algorithms we find the re- 
sults shown in Fig. [§] In the left-most two panels we 
show the dendrograms generated by the shortest-path 
and random-walk versions of our algorithm, along with 
the modularity measures for the same. As we see, both 
algorithms give reasonably high values for the modularity 
when the network is split into two communities — around 
0.4 in each case — indicating that there is a strong nat- 
ural division at this level. What's more, the divisions 
in question correspond almost perfectly to the actual di- 
visions in the club revealed by which group each club 
member joined after the club split up. (The shapes of 
the vertices representing the two factions are the same as 
those of Fig. (SJ) Only one vertex, vertex 3, is misclassi- 
fied by the shortest-path version of the method, and none 
are misclassified by the random-walk version — the latter 
gets a perfect score on this test. (On the other hand, the 
two-community split fails to produce a local maximum in 
the modularity for the random-walk method, unlike the 
shortest-path method for which there is a local maximum 
precisely at this point.) 

In the last panel of Fig. [5] we show the dendrogram 
and modularity for an algorithm based on shortest-path 
betweenness but without the crucial recalculation step 
discussed in Sec. [H] As the figure shows, without this 
step, the algorithm fails to find the division of the net- 
work into the two known groups. Furthermore, the mod- 
ularity doesn't reach nearly such high values as in the 
first two panels, indicating that the divisions suggested 



are much poorer than in the cases with the recalculation. 



C. Collaboration network 

For our next example, we look at a collaboration net- 
work of scientists. Figure HOh shows the largest com- 
ponent of a network of collaborations between physi- 
cists who conduct research on networks. (The authors 
of the present paper, for instance, are among the nodes 
in this network.) This network (which appeared previ- 
ously in Ref. l3a) was constructed by taking names of 
authors appearing in the lengthy bibliography of Ref. 
and cross-referencing with the Physics E-print Archive at 
arxiv . org, specifically the condensed matter section of 
the archive where, for historical reasons, most papers on 
networks have appeared. Authors appearing in both were 
added to the network as vertices, and edges between them 
indicate coauthorship of one or more papers appearing 
in the archive. Thus the collaborative ties represented in 
the figure are not limited to papers on topics concerning 
networks — we were interested primarily in whether peo- 
ple know one another, and collaboration on any topic is 
a reasonable indicator of acquaintance. 

The network as presented in Fig. HOfa is difficult to in- 
terpret. Given the names of the scientists, a knowledge- 
able reader with too much time on their hands could, no 
doubt, pick out known groupings, for instance at partic- 
ular institutions, from the general confusion. But were 
this a network about which we had no a priori knowledge, 
we would be hard pressed to understand its underlying 
structure. 

Applying the shortest-path version of our algorithm 
to this network we find that the modularity, Eq. JSJ, 
has a strong peak at 13 communities with a value of 
Q = 0.72 ± 0.02. Extracting the communities from the 
corresponding dendrogram, we have indicated them with 
colors in Fig. 110b . The knowledgeable reader will again 
be able to discern known groups of scientists in this ren- 
dering, and more easily now with the help of the colors. 
Still, however, the structure of the network as a whole 
and the of the interactions between groups is quite un- 
clear. 

In Fig. HUb we have reduced the network to only the 
groups. In this panel, we have drawn each group as a 
circle, with size varying roughly with the number of indi- 
viduals in the group. The lines between groups indicate 
collaborations between group members, with the thick- 
ness of the lines varying in proportion to the number of 
pairs of scientists who have collaborated. Now the over- 
all structure of the network becomes easy to see. The 
network is centered around the middle group shown in 
cyan (which consists of researchers primarily in southern 
Europe), with a knot of inter-community collaborations 
going on between the groups on the lower right of the 
picture (mostly Boston University physicists and their 
intellectual descendants). Other groups (including the 
authors' own) are arranged in various attitudes further 
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FIG. 9: Community structure in the karate club network. Left: the dendrogram extracted by the shortest-path betweenness 
version of our method, and the resulting modularity. The modularity has two maxima (dotted lines) corresponding to splits 
into two communities (which match closely the real-world split of the club, as denoted by the shapes of the vertices) and five 
communities (though one of those five contains only one individual). Only one individual, number 3, is incorrectly classified. 
Center: the dendrogram for the random walk version of our method. This version classifies all 34 vertices correctly into the 
factions that they actually split into (first dotted line), although the split into four communities gets a higher modularity score 
(second dotted line). Right: the dendrogram for the shortest-path algorithm without recalculation of betweennesses after each 
edge removal. This version of the calculation fails to find the split into the two factions. 



out. 

One of the problems created by the sudden availability 
in recent years of large network data sets has been our 
lack of tools for visualizing their structure In the 
early days of network analysis, particularly in the social 
sciences, it was usually enough simply to draw a picture 
of a network to see what was going on. Networks in those 
days had ten or twenty nodes, not 140 as here, or several 
billion as in the world wide web. We believe that methods 
like the one presented here, of using community structure 
algorithms to make a meaningful "coarse graining" of a 
network, thereby reducing its level of complexity to one 
that can be interpreted readily by the human eye, will 
be invaluable in helping us to understand the large-scale 
structure of these new network data. 



D. Other examples 

In this section, we briefly describe example applica- 
tions of our methods to three further networks. The first 
is a non-human social network, a network of dolphins, the 
second a network of fictional characters, and the third not 
a social network at all, but a network of web pages and 
the links between them. 

In Fig. II II we show the social network of a community 
of 62 bottlenose dolphins living in Doubtful Sound, New 
Zealand. The network was compiled by Lusseau [37| from 
seven years of field studies of the dolphins, with ties be- 
tween dolphin pairs being established by observation of 
statistically significant frequent association. The network 
splits naturally into two large groups, represented by the 
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FIG. 10: Illustration of the use of the community structure algorithm to make sense of a complex network, (a) The initial 
network is a network of coauthorships between physicists who have published on topics related to networks. The figure shows 
only the largest component of the network, which contains 145 scientists. There are 90 more scientists in smaller components, 
which are not shown, (b) Application of the shortest-path betweenness version of the community structure algorithm produces 
the communities shown by the colors, (c) A coarse-graining of the network in which each community is represented by a single 
node, with edges representing collaborations between communities. The thickness of the edges is proportional to the number 
of pairs of collaborators between communities. Clearly panel (c) reveals much that is not easily seen in the original network of 
panel (a). 
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FIG. 11: Community structure in the bottlenose dolphins of 
Doubtful Sound |37l |38| . extracted using the shortest-path 
version of our algorithm. The squares and circles denote the 
primary split of the network into two groups, and the circles 
are further subdivided into four smaller groups denoted by 
the different shades of vertices. The modularity for the split 
is Q = 0.52. The network has been drawn with longer edges 
between vertices in different communities than between those 
in the same community, to make the community groupings 
clearer. The same is also true of Figs. I12| and ll.'-il 



circles and squares in the figure, and the larger of the two 
also splits into four smaller subgroups, represented by the 
different shades. The modularity is Q — 0.38 ± 0.08 for 
the split into two groups, and peaks at 0.52 ± 0.03 when 
the subgroup splitting is included also. 

The split into two groups appears to correspond to a 
known division of the dolphin community |38| . Lusseau 
reports that for a period of about two years during ob- 
servation of the dolphins they separated into two groups 
along the lines found by our analysis, apparently because 
of the disappearance of individuals on the boundary be- 
tween the groups. When some of these individuals later 
reappeared, the two halves of the network joined together 
once more. As Lusseau points out, developments of this 
kind illustrate that the dolphin network is not merely 
a scientific curiosity but, like human social networks, is 
closely tied to the evolution of the community. The sub- 
groupings within the larger half of the network also seem 
to correspond to real divisions among the animals: the 
largest subgroup consists almost of entirely of females 
and the others almost entirely of males, and it is conjec- 
tured that the split between the male groups is governed 
by matrilineage (D. Lusseau, personal communication). 

Figure 1121 shows the community structure of the net- 
work of interactions between major characters in Victor 
Hugo's sprawling novel of crime and redemption in post- 
restoration France, Les Miserables. Using the list of char- 
acter appearances by scene compiled by Knuth (3^, the 
network was constructed in which the vertices represent 
characters and an edge between two vertices represents 
co-appearance of the corresponding characters in one or 
more scenes. The optimal community split of the result- 



ing graph has a strong modularity of Q = 0.54 ± 0.02, 
and gives 11 communities as shown in the figure. The 
communities clearly reflect the subplot structure of the 
book: unsurprisingly, the protagonist Jean Valjean and 
his nemesis, the police officer Javert, are central to the 
network and form the hubs of communities composed of 
their respective adherents. Other subplots centered on 
Marius, Cosette, Fantine, and the bishop Myriel are also 
picked out. 

Finally, as an example of the application of our method 
to a non-social network, we have looked at a web graph — 
a network in which the vertices and edges represent web 
pages and the links between them. The graph in question 
represents 180 pages from the web site of a large corpora- 
tion |5lj |. Figure H3l shows the network and the commu- 
nities found in it by the shortest-path version of our algo- 
rithm. This network has one of the strongest modularity 
values of the examples studied here, at Q — 0.65 ± 0.02. 
The links between web pages are directed, as indicated by 
the arrows in the figure, but, as discussed in Sec. IIII Al 
for the purposes of finding the communities we ignore 
direction and treat the network as undirected. 

Certainly it might be useful to know the communities 
in a web network; an algorithm that can pick out com- 
munities could reveal which pages cover related topics or 
the social structure of links between pages maintained by 
different individuals. Ideas along these lines have been 
pursued b y, f or example, Flake et al. ^3 an d Adamic 
and Adar 41]. 



VI. CONCLUSIONS 

In this paper we have described a new class of algo- 
rithms for performing network clustering, the task of ex- 
tracting the natural community structure from networks 
of vertices and edges. This is a problem long studied in 
computer science, applied mathematics, and the social 
sciences, but it has lacked a satisfactory solution. We 
believe the methods described here give such a solution. 
They are simple, intuitive, and demonstrably give excel- 
lent results on networks for which we know the commu- 
nity structure ahead of time. Our methods are defined 
by two crucial features. First, we use a "divisive" tech- 
nique which iteratively removes edges from the network, 
thereby breaking it up in communities. The edges to be 
removed are identified by using one of a set of edge be- 
tweenness measures, of which the simplest is a generaliza- 
tion to edges of the standard shortest-path betweenness 
of Freeman. Second, our algorithms include a recalcu- 
lation step in which betweenness scores are re-evaluated 
after the removal of every edge. This step, which was 
missing from previous algorithms, turns out to be of pri- 
mary importance to the success of ours. Without it, the 
algorithms fail miserably at even the simplest clustering 
tasks. 

We have demonstrated the efficacy and utility of our 
methods with a number of examples. We have shown 
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FIG. 12: The network of interactions between major characters in the novel Les Miserables by Victor Hugo. The greatest 
modularity achieved in the shortest-path version of our algorithm is Q = 0.54 and corresponds to the 11 communities represented 
by the colors. 



that our algorithms can reliably and sensitively extract 
community structure from artificially generated networks 
with known communities. We have also applied them 
to real-world networks with known community structure 
and again they extract that structure without difficulty. 
And we have given examples of how our algorithms can 
be used to analyze networks whose structure is otherwise 
difficult to comprehend. The networks studied include a 
collaboration network of scientists, in which our methods 
allow us to generate schematic depictions of the overall 
structure of the network and collaborations taking place 
within and between communities, other social networks 
of people and of animals, and a network of links between 
pages on a corporate web site. 

The primary remaining difficulty with our algorithms 
is the relatively high computational demands they make. 
The fastest of them, the one based on shortest-path be- 
tweenness, operates in 0(n 3 ) time on a sparse graph, 
which makes it usable for networks up to about 10 000 
vertices, but for larger systems it becomes intractable. 
Although the ever-improving speed of computers will cer- 
tainly raise this limit in coming years, it would be more 
satisfactory if a faster version of the method could be dis- 
covered. One possibility is parallelization: the between- 



ness calculation involves a sum over source vertices and 
the elements of that sum can be distributed over different 
processors, making the calculation trivially parallelizable 
on a distributed-memory machine. However, a better 
approach would be to find some improvement in the al- 
gorithm itself to decrease its computational complexity. 

Since the publication of our first paper on this 
topic [25j . several other authors have made use of the 
shortest-path version of our algorithm. Holme et al. |4^| 
have applied it to a number of metabolic networks for 
different organisms, finding communities that correspond 
to functional units within the networks, while Wilkinson 
and Huberman have applied it to a network of re- 
lations between genes, as established by co-occurrence 
of names of genes in published research articles. An in- 
teresting application to social networks is the study by 
Gleiser and Danon of the collaboration network of 
early jazz musicians. They found, among other things, 
that the network split into two communities along lines of 
race, black musicians in one group, white musicians in the 
other. Guimera et al. |45j have applied the method to a 
network of email messages passing between users at a uni- 
versity, and found communities that reflect both formal 
and informal levels of organization. Tyler et al. 0] have 
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of possible source vertices in the network, rather than 
summing over all sources. The size of the subset is de- 
cided on the fly, by sampling source vertices until the 
betweenness of at least one edge in the network exceeds 
a predetermined threshold. This technique reduces the 
running time of the calculation considerably, although 
the resulting estimate of betweenness necessarily suffers 
from the statistical fluctuations inherent in random sam- 
pling methods. This idea, or a variation of it, might 
provide a solution to the problems mentioned above of 
the high computational demands of our algorithms. 

We are of course delighted to see our methods applied 
to such a variety of problems. Combined with the new 
algorithms and measures described in this paper, we hope 
to see many more applications in the future. 



FIG. 13: Pages on a web site and the hyperlinks between 
them. The colors denote the optimal division into communi- 
ties found by the shortest-path version of our algorithm. 

also applied the algorithm to an email network, in their 
case at a large company, finding that the resulting com- 
munities correspond closely to organizational units. The 
latter work is interesting also in that it suggests a method 
for improving the speed of the algorithm: Tyler et al. cal- 
culate betweenness for only a subset, randomly chosen, 
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